Learning Ranking Functions Incorporating Boosted Ranking In A Regression Framework For Information Retrieval And Ranking

ABSTRACT

Embodiments of the present invention provide for methods, systems and computer program products for learning ranking functions to determine the ranking of one or more content items that are responsive to a query. The present invention includes generating one or more training sets comprising one or more content item-query pairs, determining preference data for the one or more query-content item pairs of the one or more training sets and determining labeled data for the one or more query-content item pairs of the one or more training sets. A ranking function is determined based upon the preference data and the labeled data for the one or more content-item query pairs of the one or more training sets. The ranking function is then stored for application to query-content item pairs not contained in the one or more training sets.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever.

FIELD OF THE INVENTION

The invention disclosed herein relates generally to informationretrieval and ranking. More specifically, the present invention relatesto systems, methods and computer program products for the determinationof learning ranking functions that incorporate boosted ranking todetermine the ranking of one or more content items that are responsiveto a query.

BACKGROUND OF THE INVENTION

The World Wide Web provides access to an extraordinary large collectionof information sources (in various formats including text, images,videos and other media content) relating to virtually every subjectimaginable. As the World Wide Web has grown, the ability of users tosearch this collection of information and identify content relevant to aparticular subject has become increasingly important.

A user of a search engine, for example, typically supplies a query tothe search engine that contains only a few terms and expects the searchengine to return a result set comprising relevant content items.Although a search engine may return a result set comprising hundreds ofrelevant content items, most users are likely to only view the topseveral content items in a result set. Thus, to be useful to a user, asearch engine should determine those content items in a given result setthat are most relevant to the user, or that the user would be mostinterested in, on the basis of the query that the user submits and ranksuch content items accordingly.

A user's view as to which content items are relevant to the query isinfluenced by a number of factors, many of which are highly subjective.Due to the highly subjective nature of such factors, it is generallydifficult to capture in an algorithmic set of rules those factors thatdefine a function for ranking content items. Furthermore, thesesubjective factors may change over time, as for example when currentevents are associated with a particular query term. Thus, users whoreceive search result sets that contain results not perceived to behighly relevant quickly become frustrated and may potentially abandonthe use of a particular search engine. Therefore, designing an effectiveand efficient function that is operative to retrieve and efficientlyrank content items is of the upmost importance to information retrieval.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide for methods, systems andcomputer program products for learning ranking functions to determinethe ranking of one or more content items that are responsive to a query.The present invention includes generating one or more training setscomprising one or more content item-query pairs, determining preferencedata for the one or more query-content item pairs of the one or moretraining sets and determining labeled data for the one or morequery-content item pairs of the one or more training sets. A rankingfunction is determined based upon the preference data and the labeleddata for the one or more content-item query pairs of the one or moretraining sets. The ranking function is then stored for application toquery-content item pairs not contained in the one or more training sets.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is illustrated in the figures of the accompanying drawingswhich are meant to be exemplary and not limiting, in which likereferences are intended to refer to like or corresponding parts, and inwhich:

FIG. 1 illustrates a block diagram of a system for learning rankingfunctions that incorporate boosted ranking in a regression framework todetermine the ranking of one or more content items that are responsiveto a query according to one embodiment of the present invention;

FIG. 2 illustrates a flow diagram presenting a method for learningranking functions that incorporate boosted ranking in a regressionframework to determine the ranking of one or more content items that areresponsive to a query according to one embodiment of the presentinvention; and

FIG. 3 presents one embodiment of a method for using a learned rankingfunction to order a result set that the search engine deems responsiveto a given query according to one embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

In the following description of the embodiments of the invention,reference is made to the accompanying drawings that form a part hereof,and in which is shown by way of illustration, exemplary embodiments inwhich the invention may be practiced. It is to be understood that otherembodiments may be utilized and structural changes may be made withoutdeparting from the scope of the present invention.

FIG. 1 illustrates one embodiment of a system for learning rankingfunctions that incorporate boosted ranking in a regression framework todetermine the ranking of one or more content items that are responsiveto a query 100. System 100 includes one or more clients 110, a computernetwork 120, one or more content providers 130, and a search provider140. The search provider 140 comprises a search engine 150, an indexingcomponent 160, an index data store 170, a ranking engine 180 and apersistent data store 190.

The computer network 120 may be any type of computerized network capableof transferring data, such as the Internet. According to one embodimentof the invention, a given client device 110 is a general purposepersonal computer comprising a processor, transient and persistentstorage devices, input/output subsystem and bus to provide acommunications path between components comprising the general purposepersonal computer. For example, a 3.5 GHz Pentium 4 personal computerwith 512 MB of RAM, 40 GB of hard drive storage space and an Ethernetinterface to a network. Other client devices are considered to fallwithin the scope of the present invention including, but not limited to,hand held devices, set top terminals, mobile handsets, PDAs, etc. Thepresent invention is not limited to only a single client device 110 andmay comprise additional, disparate client devices. The client device 110is therefore presented for illustrative purposes representative ofmultiple client devices.

According to one embodiment of the invention, a given content provider130 and the search provider 140 are programmable processor-basedcomputer devices that include persistent and transient memory, as wellas one or more network connection ports and associated hardware fortransmitting and receiving data on the network 120. The content provider130 and the search provider 140 may host websites, store data, serveads, etc. The present invention is not limited to only a single contentprovider 130 and may comprise additional, disparate content providers.The content provider 130 is therefore presented for illustrativepurposes representative of multiple content providers. Those of skill inthe art understand that any number and type of content provider 130,search provider 140 and client device 110 may be connected to thenetwork 120.

The search engine 150, the indexing component 160 and the ranking engine180 may comprise one or more processing elements operative to performprocessing operations in response to executable instructions,collectively as a single element or as various processing modules, whichmay be physically or logically disparate elements. The index data store170 and the persistent data store 190 may be one or more data storagedevices of any suitable type, operative to store corresponding datatherein. Those of skill in the art recognize that the search provider140 may utilize more or fewer components and data stores, which may belocal or remote with regard to a given component or data store.

In accordance with one embodiment, the client device 110, the searchprovider 140 and the content provider 130 are communicatively coupled tothe computer network 120. Using the network 120, the search provider 140is capable of accessing the content provider 130, which hosts contentitems a user may wish to locate through use of the search engine 150 atthe search provider 140. The search provider 140 may communicate withthe content provider 130 for maintaining cached copies at the persistentdata store of content items that the content provider 130 hosts. Thecollection of items content items, as well as information regardingcontent items, is referred to as “crawling,” and is the process by whichthe search provider 140 collects information upon which the searchengine 150 performs searches.

The search provider 140 crawls one or more content providers 130 thatare in communication with the network 120, which may comprise collectingcombinations of content items and information regarding the same. Anindex component 160 parses and indexes the content items and relatedinformation that the search provider 150 collects through the crawlingprocess. The index component 160 generates an index that provides astructure for the content items and related information that allows forlocation and retrieval of the content items and related information.According to one embodiment of the invention, the index component 160creates an index of word-location pairs that allows the search engine150 to identify specific content items and information regarding thesame in response to a query, which may be from a user, softwarecomponent, automated process, etc. The one or more indexes that theindexing component 160 generates are written to an index data store 170for persistent storage and use by other components of the searchprovider 140.

A user of a given client device 110 desires to retrieve a content itemfrom a content provider 130 that is relevant to a particular topic, butwho is unsure or ignorant regarding the address or location of thecontent item, submits a query to the search engine 150. According to oneembodiment, a user utilizes a given client device 110 to connect overthe network 120 to the search engine 150 at the search provider 140 andprovide a query. A typical query has one or more terms. For example, thequery “New York Yankees” contains three terms and is referred to as athree-term query. Similarly, queries containing only one term arereferred to as one-term queries, queries containing two terms aretwo-term queries, etc. A space or other delimiter character that thesearch engine 150 comprehends may delimit individual terms comprising agiven query.

Upon receipt of the query, the search engine 150 examines the indexusing the query terms in an attempt to identify a result set thatcomprises content items that are responsive to the query. The searchengine 150 formulates the result set for transmission over the network120 and presentation to the user through use of the client device 110.Where the result set comprises one or more links to content items, theuser may select a link in the result set to navigate to the contentprovider 130 that is hosting the content item that the link identifies.The search engine 150 may utilize a persistent data store 190 forstorage of an historical log of the queries that users submit, which mayinclude an indication of the selection by users of items in results setsthat the search engine 150 transmits to users.

As discussed previously, users become increasingly frustrated whenpresented with a result set that does not identify content items thatare more relevant to a given query prior to less relevant content items.Accordingly, the present embodiment provides a ranking engine 180 thatis operative to utilize machine learning to determine a function tocompute the relevance of a given content item to a given query. Theranking engine 180 receives query-content item pairs and applies aranking function, the selection of which is described in greater detailherein, to determine the rank of the content item vis-à-vis the query.

The ranking engine 180 utilizes a feature vector of a givenquery-content item pair in determining the appropriate ranking. Afeature vector may be developed by extracting a set of features from agiven query-content item pair, where a given feature may be representedas a quantification of an aspect of a relationship between a query andcontent item, which may include quantifying aspects of the query, thecontent item, or both. According to one embodiment, a feature vectorconsists of three vector components: (1) query-feature vector x^(Q), (2)query-feature vector x^(D) and (3) query-feature vector x^(QD).Query-feature vector x^(Q) comprises features dependent on the query qand have constant values across all documents d in a document set, forexample, the number of terms in the query or whether or not the query isa person's name. Query-feature vector x^(D) comprises features dependenton a given document d and have constant values across all the queries qin the query set, for example the number of inbound links pointing tothe given document, the amount of anchor-text in bytes for the givendocument and the language identity of the given document. Query-featurevector x^(QD) comprises features dependent on the relation of the queryq with respect to a given document d, for example, the number of timeseach term in the query q appears in the given document d or the numberof times each term in the query q appears in the anchor-texts of thegiven document d.

A training set forms the basis for determining a ranking function thatthe ranking engine 180 utilizes to determine the ranking of contentitems responsive to a given query. The ranking engine 180 receives atraining query and a set of content items for inclusion in a primarytraining set, which the ranking engine 180 may select. According to oneembodiment, the ranking engine 180 presents content items from thetraining set to one or more human subjects for the assignment of a labelindicating the relevance of content items in the training set to thequery. For example, a five-level numerical grade (0, 1, 2, 3, or 4) isassigned to each query-document pair based on the degree of relevance,with a numerical grade 0 being least relevant and a numerical grade 4being most relevant. Alternatively, the ranking engine 180 may accessthe persistent data store 190 to retrieve a past query (training query)and corresponding result set (primary training set), utilizing selectioninformation from a user regarding the selection of items in the resultset in response to the query to determine labels for the content itemsin the primary training set to the query.

The ranking engine 180 may develop one or more training subsets, a giventraining subset utilizing the label data and the feature vectors. Forexample, the ranking engine 180 may divide the primary training set intotwo training sets, where a first training subset comprises querydocument pairs that are assigned labels, L₁. According to oneembodiment, the first training subset may encompass query document pairsthat are assigned identical labels, such as all query document pairsthat are assigned a label “2”. According to another embodiment, thefirst training subset may encompass all query document pairs that areassigned any label, such as all query-document pairs that are assignedany of the labels from the five-level numerical grade set.

The remaining query document pairs may then be placed into a secondtraining subset, L₂. According to one embodiment, where all querydocument pairs assigned an identical label are placed in L₁, the secondtraining subset may comprise all remaining query document pairs that areassigned varying labels. According to another embodiment, where allquery document pairs that are assigned any labels are placed in L₁, theremaining query document pairs without an assigned label are placed inthe second training subset. According to other embodiments, the secondtraining subset may comprise any combination of query documents pairsassigned a label or not assigned a label.

While the training subset L₁ is used to extract labeled data, thetraining subset L₂ may be used to generate a set of preference databased upon the feature vectors for one or more query document pairs bycomparing the degree of relevancy between two content items with respectto a given query. For example, training subset L₂ may contain a givenquery q and two corresponding documents, d_(x) and d_(y). The featurevectors for the query document pairs (q, d_(x)) and (q, d_(y)) may betermed x and y, respectively. If d_(x) is more relevant to the query qthan d_(y), the preference x>y may be established. The degree ofrelevancy of d_(x) to the query q as compared to d_(y) may be determinedfrom labeled data, such as where d_(x) has higher numerical label thand_(y) for the query q, or from click through data, such as whereselection information from users demonstrate that d_(x) is selected morethan d_(y) for the query q. The ranking engine 180 may consider, for agiven query in the respective training subsets, one or more pairs ofdocuments within the search results in establishing preferences.Continuing with the present example, the ranking engine 180 may developpreference data from training subset L₂ only. On the basis of the labeldata and the preference data, the ranking engine 180 is operative toidentify a ranking function, which the ranking engine 180 applies todetermine the ranking of content items to a new query that the searchengine 150 receives.

According to one embodiment, the ranking function that the rankingengine 180 develops and applies may utilize a general boosting frameworkthat extends functional gradient boosting to optimize complex lossfunctions, which is explained in greater detail herein. Morespecifically, the embodiment uses a regression method to optimize a lossfunction based on quadratic upper bounds. Those of skill in the artrecognize that a loss function may determine the difference over thecontent items in the training subsets between a label assigned by ahuman (as discussed, these labels may also be determined by the othermethods that reflect the human perception regarding the relevance of agiven query-content item pair, e.g., click-through information from thepersistent data store 190 that the ranking engine 180 utilizes) and theoutput of the ranking function. Accordingly, when the ranking engine 180applies the trained ranking function, the ranking engine 180 maydetermine the ranking of content items related to a query that thesearch engine 150 is processing.

When the search engine 150 receives a query from a client device 110that the user is utilizing, the search engine 150 queries the index inthe index data store 170 to determine a result set that is responsive tothe query. The search engine 108 may also pass the query and result setto the ranking engine 180. The ranking engine 180 applies the trainedranking function to the one or more content items that it receives fromthe search engine 150, determining the ranking for the content items inthe result set. According to one embodiment, the ranking engine 180 mayuse the ranking function to develop a ranking score for each contentitem. The search engine 150 receives the ranking scores for the contentitems in the result set from the ranking engine 180 and may utilize thescores for ranking or ordering purposes, e.g., presenting the contentitems or links to content items with the highest ranking scores (morerelevant) prior to the content items or links to content items in theresult set with lesser ranking scores (less relevant). The search engine150 transmits the ranked result set to the client device 110 for viewingby the user.

As introduced in FIG. 1, FIG. 2 illustrates one embodiment a method forlearning ranking functions that incorporate boosted ranking to determinethe ranking of one or more content items that are responsive to a query.To understand the ranking function that is developed to determine theranking of one or more content items that are responsive to a query, itis useful to formulate the learning problem under consideration. Ageneral optimization problem may be modeled in the following equation:

$\begin{matrix}{{{(h)} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}\; {\varphi_{i}\left( {{h\left( x_{i,1} \right)},\ldots \mspace{11mu},{h\left( x_{i,m_{i}} \right)},y_{i}} \right)}}}},} & {{Equation}\mspace{20mu} 2}\end{matrix}$

where h denotes a prediction function to be learned from the data, H isa pre-determined function class and R(h) is a risk functional withrespect to h. The risk functional may be expressed in the followingform:

$\begin{matrix}{{\hat{h} = {\arg \; {\min\limits_{h \in \mathcal{H}}\; {(h)}}}},} & {{Equation}\mspace{20mu} 1}\end{matrix}$

where φ_(i)(h₁, . . . h_(mi), y) is a general loss function with respectto the first m_(i) arguments h₁, . . . , h_(mi). The loss function φ_(i)can be a single variable function, where m_(i)=1, such as in a linearregression; or a two-variable function, where m_(i)=2, which may be usedin pair-wise comparison; or a multi-variable function as used in certainstructured prediction problems.

A solution to the loss function expressed in Equation 2 has beenproposed in the art using the process of gradient boosting, expressed inthe following equation:

$\begin{matrix}{{{(h)} = {\sum\limits_{i = 1}^{n}\; {\varphi_{i}\left( {h\left( x_{i} \right)} \right)}}},} & {{Equation}\mspace{20mu} 3}\end{matrix}$

where the gradient ∇φ_(i)(h(χ_(i))) is estimated using a regressionprocedure at each step with uniform weighting. However, the process ofgradient boosting does not provide for a convergence analysis.Therefore, according to one embodiment of the present invention, anextension of gradient boosting is utilized where a convergence analysismay be obtained, which may be expressed in the following equation:

R(h)=R(h(χ₁), . . . , h(χ_(N))),  Equation 4

where N≦Σm_(i), such that as R depends on h only through the functionvalues h(x_(i)) and the function h can be identified with the vector[h(x_(i))] as a function of N variables.

Using a Taylor expansion, the risk functional may be expanded at eachtentative solution h_(k) for the function h as expressed in thefollowing equation:

$\begin{matrix}{{{\left( {h_{k} + g} \right)} = {{\left( h_{k} \right)} + {{\nabla{\left( h_{k} \right)}^{T}}g} + {\frac{1}{2}g^{T}{\nabla^{2}{\left( h^{\prime} \right)}}g}}},} & {{Equation}\mspace{20mu} 5}\end{matrix}$

Analysis of the right-hand side of Equation 5 reveals that it resemblesa quadratic, thus allowing for the replacement of the right hand sidewith a quadratic upper bound:

$\begin{matrix}{{{{\left( {h_{k} + g} \right)} \leq {_{k}(g)}} = {{\left( h_{k} \right)} + {{\nabla{\left( h_{k} \right)}^{T}}g} + {\frac{1}{2}g^{T}{Wg}}}},} & {{Equation}\mspace{20mu} 6}\end{matrix}$

where W is a Hessian diagonal matrix providing an upper bound betweenh_(k) and h_(k)+g. If a variable is defined asτ_(j)=−[∇R(h_(k))]_(j)/w_(j), then ∇gεC,Σ_(j)w_(j)(g(χ_(j))−τ_(j))² isequal to the above quadratic form up to a constant. Accordingly, g maybe found by calling a regression weak learner. Since at each step anattempt is made to minimize an upper bound R_(k) of R, if the minimum isset to g_(k), it becomes clear thatR(h_(k)+g_(k))≦R_(k)(g_(k))≦R(h_(k)), so that by optimizing with respectto the problem R_(k), progress can be made with respect to optimizing R.

The theory of utilizing quadratic approximation as an extension ofgradient boosting is used to develop the following algorithm:

Algorithm 1 Greedy Algorithm with Quadratic Approximation Input: X =[x_(e)]_(e−1,...,N) let h₀ = 0 for k = 0, 1, 2,...  let W =[w_(e)]_(e−1,...,N). with either   w_(e) =

²R/

h_(k)(x_(e))² or         % Newton-type method   with diagonal Hessian  Wglobal diagonal upper bound on the Hessian       % Upper-  boundminimization  let R = [r_(e)]_(e−1,...,N). where r_(e) = w_(e) ⁻¹

R/

h_(k)(x_(e))  pick

_(k) ≧ 0  let g_(k) = A(W, X, R, ε_(k))  pick step-size s_(k) ≧ 0,typically by line search on R  let h_(k+1) = h_(k) + s_(k)g_(k) endAlgorithm 1 may then be utilized to determine a ranking function thatincorporates boosted ranking to determine the ranking of one or morecontent items that are responsive to a query.

As FIG. 2 illustrates, the determination of a ranking function mayinitiate with the generation of one or more training sets, step 202. Asdiscussed previously, a ranking engine may develop one or more trainingsubsets from a primary training set utilizing both label data andfeature vectors for query document pairs. The ranking engine may utilizethe one or more training sets to develop a set of available preferences,step 204, which may be denoted as

S={x_(i)>y_(i), i=1, . . . , N}

In addition, the ranking engine may determine labeled data of thegenerated training set, step 206, which may be denoted as,

L={(z_(i),l_(i)), i=1, . . . , n}

where z_(i) is the feature of a content item and l_(i) is thecorresponding numerically coded label.

Using both the preference data and the labeled data, a risk functionalmay be formulated, step 208, that satisfies the ranking function h suchthat the ranking function h satisfies the set of preferences, i.e.h(x_(i))≧h(y_(i)), if x_(i)>y_(i), i=1, . . . , N, while at the sametime h(z_(i)) matches the label the label l_(i). Using Algorithm 1 tofit both the preference data and the labeled data of the generatedtraining set, a risk functional is formulated that satisfies the rankingfunction h, which may be expressed as:

$\begin{matrix}{{(h)} = {{\frac{w}{2}{\sum\limits_{i = 1}^{N}\; \left( {\max \left\{ {0,{{h\left( y_{i} \right)} - {h\left( x_{i} \right)} + \tau}} \right\}} \right)^{2}}} + {\frac{1 - w}{2}{\sum\limits_{i = 1}^{n}\; {\left( {l_{i} - {h\left( z_{i} \right)}} \right)^{2}.}}}}} & {{Equation}\mspace{20mu} 7}\end{matrix}$

As Equation 7 illustrates, the risk functional consists of two parts:(1) with respect to preference data, a margin parameter τ is introducedto enforce that h(x_(i))≧h(y_(i))+τ, and if not the difference isquadratically penalized and (2) with respect to labeled data, thesquared errors are minimized.

The parameter w is introduced as the relative weight for the preferencethat may be found by cross-validation. The risk functional expressed inEquation 7 may be optimized using quadratic approximation, where

h(χ_(i)),h(y_(i)), i=1, . . . , N, h(z_(i)) i=1, . . . , n

are considered unknowns and the gradient of R(h) is calculated withrespect to those unknowns. In performing the calculation, the componentsof the negative gradient corresponding to h(z_(i)) is determined to bel_(i)−h(z_(i)) and the components of the negative gradient correspondingto h(x_(i)) and h(y_(i)), respectively are determined to be

max{0,h(y_(i))−h(χ_(i))+τ},−max{0,h(y_(i))−h(χ_(i))+τ}.

When h(x_(i))−h(y_(i))≧τ, the components of the negative gradientcorresponding to h(x_(i)) and h(y_(i)) equal zero. For the second-orderterm, however, a verification may be made that the Hessian of R(h) is ablock diagonal with 2-by-2 blocks corresponding to h(x_(i)) andh(y_(i)), and 1-by-1 blocks for h(z_(i)). In particular, if the Hessianis evaluated at h, the 2-by-2 block equals to

$\begin{bmatrix}1 & {- 1} \\{- 1} & 1\end{bmatrix},\begin{bmatrix}0 & 0 \\0 & 0\end{bmatrix}$

for x_(i)>y_(i) with h(x_(i))−h(y_(i))<τ and h(x_(i))−h(y_(i))≧τ,respectively. The first matrix may be upper bounded by the diagonalmatrix leading to a quadratic upper bound.

Utilizing the risk functional, a ranking function algorithm may bederived, step 210, expressed as follows:

Algorithm 2 Boosted Ranking using Successive Quadratic Approximation(QBRank) Start with an initial guess h₀, for m = 1,2, ...,  1) weconstruct a training set for fitting g_(m)(x) by adding the following for each (x_(i), y_(i)) ε S,   (x_(i), max{0, h_(m−1)(y_(i)) −h_(m−1)(x_(i)) + τ}), (y_(i), − max{0,  h_(m−1)(y_(i)) −h_(m−1)(x_(i)) + τ}), and   {(z_(i), l_(i) − h_(m−1)(z_(i))), i = 1,..., n}.  The fitting of g_(m)(x) is done by using a base regressor withthe above  training set; We weigh the above preference data by w andlabeled data  by 1 − w respectively.  2) forming h_(m) = h_(m−1) +ηs_(m)g_(m)(x),  where s_(m) is found by line search to minimize theobjective function.  η is a shrinkage factor.The output of the ranking function algorithm is the ranking function h,step 212, which the algorithm expresses as h_(m)=h_(m-1)+ηs_(m)g_(m)(x),where η is a shrinkage factor that is determined by cross-validation ands_(m) is representative of a line search strategy incorporated into theranking function to minimize the function. Upon determining the rankingfunction the ranking engine stores the ranking function for use indetermining the ranking of query-content item pairs not contained in thetraining set, step 214.

FIG. 3 presents one embodiment of a method for using the rankingfunction to order a result set that the search engine deems responsiveto a given query. A user submits a query to the search engine, step 302,which causes the search engine to retrieve a set of content items thatare to be ranked according to the relevance of the content items to thequery, step 304. In one embodiment, only content including or associatedwith one or more terms in the query are included in the result set, forexample, content items that contain user supplied tags that contain theterms. In another embodiment, the search engine may utilize othercriteria to select content for inclusion in the result set.

According to one embodiment, the trained ranking function is used todetermine a ranking score for a given content item as paired with thequery, step 306. The ranking function receives a feature vector as wellas labeled data for a given content item as input and provides a rankingscore. A check is performed to determine if additional content itemsexist to which the ranking function is to be applied, step 308.Processing continues until a ranking score has been calculated for eachof the content items in the result set, step 306. The search engineorders the result set according to the ranking score associated with thecontent item in the result set, step 310. The search engine transmitsthe ordered result set to the client device for presentation to theuser, step 312.

In accordance with the foregoing description, the present inventionprovides for systems, methods and computer program products for learningranking functions that incorporate boosted ranking to determine theranking of one or more content items that are responsive to a query. Inlearning ranking functions that incorporate boosted ranking to determinethe ranking of one or more content items that are responsive to a query,the present invention allows for an effective and efficient functionwhich retrieves and efficiently ranks content items.

FIGS. 1 through 3 are conceptual illustrations allowing for anexplanation of the present invention. It should be understood thatvarious aspects of the embodiments of the present invention could beimplemented in hardware, firmware, software, or combinations thereof. Insuch embodiments, the various components and/or steps would beimplemented in hardware, firmware, and/or software to perform thefunctions of the present invention. That is, the same piece of hardware,firmware, or module of software could perform one or more of theillustrated blocks (e.g., components or steps).

In software implementations, computer software (e.g., programs or otherinstructions) and/or data is stored on a machine readable medium as partof a computer program product, and is loaded into a computer system orother device or machine via a removable storage drive, hard drive, orcommunications interface. Computer programs (also called computercontrol logic or computer readable program code) are stored in a mainand/or secondary memory, and executed by one or more processors(controllers, or the like) to cause the one or more processors toperform the functions of the invention as described herein. In thisdocument, the terms “machine readable medium,” “computer program medium”and “computer usable medium” are used to generally refer to media suchas a random access memory (RAM); a read only memory (ROM); a removablestorage unit (e.g., a magnetic or optical disc, flash memory device, orthe like); a hard disk; electronic, electromagnetic, optical,acoustical, or other form of propagated signals (e.g., carrier waves,infrared signals, digital signals, etc.); or the like.

Notably, the figures and examples above are not meant to limit the scopeof the present invention to a single embodiment, as other embodimentsare possible by way of interchange of some or all of the described orillustrated elements. Moreover, where certain elements of the presentinvention can be partially or fully implemented using known components,only those portions of such known components that are necessary for anunderstanding of the present invention are described, and detaileddescriptions of other portions of such known components are omitted soas not to obscure the invention. In the present specification, anembodiment showing a singular component should not necessarily belimited to other embodiments including a plurality of the samecomponent, and vice-versa, unless explicitly stated otherwise herein.Moreover, applicants do not intend for any term in the specification orclaims to be ascribed an uncommon or special meaning unless explicitlyset forth as such. Further, the present invention encompasses presentand future known equivalents to the known components referred to hereinby way of illustration.

The foregoing description of the specific embodiments will so fullyreveal the general nature of the invention that others can, by applyingknowledge within the skill of the relevant art(s) (including thecontents of the documents cited and incorporated by reference herein),readily modify and/or adapt for various applications such specificembodiments, without undue experimentation, without departing from thegeneral concept of the present invention. Such adaptations andmodifications are therefore intended to be within the meaning and rangeof equivalents of the disclosed embodiments, based on the teaching andguidance presented herein. It is to be understood that the phraseologyor terminology herein is for the purpose of description and not oflimitation, such that the terminology or phraseology of the presentspecification is to be interpreted by the skilled artisan in light ofthe teachings and guidance presented herein, in combination with theknowledge of one skilled in the relevant art(s).

While various embodiments of the present invention have been describedabove, it should be understood that they have been presented by way ofexample, and not limitation. It would be apparent to one skilled in therelevant art(s) that various changes in form and detail could be madetherein without departing from the spirit and scope of the invention.Thus, the present invention should not be limited by any of theabove-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

1. A method for learning ranking functions to determine the ranking ofone or more content items that are responsive to a query, the methodcomprising: generating one or more training sets comprising one or morecontent item-query pairs; determining preference data for the one ormore query-content item pairs of the one or more training sets;determining labeled data for the one or more query-content item pairs ofthe one or more training sets; determining a ranking function based uponthe preference data and the labeled data for the one or morecontent-item query pairs of the one or more training sets; and storingthe ranking function for application to query-content item pairs notcontained in the one or more training sets.
 2. The method of claim 1,wherein determining a ranking function based upon the preference dataand the labeled data for the one or more content-item query pairs of theone or more training sets further comprises determining a rankingfunction by deriving a boosted ranking function algorithm using aregression method based upon quadratic approximation.
 3. The method ofclaim 1 comprising utilizing one or more regularization parameters tocontrol parameters of the ranking function.
 4. The method of claim 1wherein generating one or more training sets comprising one or morequery-content item pairs further comprises extracting query dependentfeatures for the one or more content item-query pairs.
 5. The method ofclaim 1 wherein generating one or more training sets comprising one ormore query-content item pairs further comprises extracting queryindependent features for the one or more content item-query pairs. 6.The method of claim 1 wherein generating one or more training setscomprising one or more query-content item pairs further comprisesproviding a given query-content item pair to a human subject todetermine a relevance label.
 7. The method of claim 1 wherein generatingone or more training sets comprising one or more query-content itempairs further comprises determining a relevance label based onhistorical click through data for a given content item-query pair. 8.The method of claim 1 further comprising applying the ranking functionto a new content item in response to receipt of a query from a user. 9.The method of claim 8 wherein applying the ranking function to a newcontent item in response to receipt of a query from a user, furthercomprises: retrieving one or more content items in a result set inresponse to receipt of the query from the user; determining a featurevector for a given content item in the result set; applying the rankingfunction to the feature vector for the given content item; andgenerating a ranking score for the given content item.
 10. The methodclaim 9 further comprising ordering the given item in the result setaccording to the ranking score for the given content item.
 11. Themethod of claim 10 further comprising transmitting the result set to theuser.
 12. Computer readable media comprising program code that whenexecuted by a programmable processor causes execution of a method forlearning ranking functions to determine the ranking of one or morecontent items that are responsive to a query, the computer readablemedia comprising: program code for generating one or more training setscomprising one or more content item-query pairs; program codedetermining preference data for the one or more query-content item pairsof the one or more training sets; program code determining labeled datafor the one or more query-content item pairs of the one or more trainingsets; program code determining a ranking function based upon thepreference data and the labeled data for the one or more content-itemquery pairs of the one or more training sets; and program code storingthe ranking function for application to query-content item pairs notcontained in the one or more training sets.
 13. The computer readablemedia of claim 12, wherein program code for determining a rankingfunction based upon the preference data and the labeled data for the oneor more content-item query pairs of the one or more training setsfurther comprises program code for determining a ranking function byderiving a boosted ranking function algorithm using a regression methodbased upon quadratic approximation.
 14. The computer readable media ofclaim 12 comprising utilizing one or more regularization parameters tocontrol parameters of the ranking function.
 15. The computer readablemedia of claim 12 wherein program code for generating one or moretraining sets comprising one or more query-content item pairs furthercomprises program code for extracting query dependent features for theone or more content item-query pairs.
 16. The computer readable media ofclaim 12 wherein program code for generating one or more training setscomprising one or more query-content item pairs further comprisesprogram code for extracting query independent features for the one ormore content item-query pairs.
 17. The computer readable media of claim12 wherein program code for generating one or more training setscomprising one or more query-content item pairs further comprisesprogram code for providing a given query-content item pair to a humansubject to determine a relevance label.
 18. The computer readable mediaof claim 12 wherein program code for generating one or more trainingsets comprising one or more query-content item pairs further comprisesprogram code for determining a relevance label based on historical clickthrough data for a given content item-query pair.
 19. The computerreadable media of claim 12 further comprising program code for applyingthe ranking function to a new content item in response to receipt of aquery from a user.
 20. The computer readable media of claim 19 whereinprogram code for applying the ranking function to a new content item inresponse to receipt of a query from a user, further comprises: programcode for retrieving one or more content items in a result set inresponse to receipt of the query from the user; program code fordetermining a feature vector for a given content item in the result set;program code for applying the ranking function to the feature vector forthe given content item; and program code for generating a ranking scorefor the given content item.
 21. The computer readable media of claim 20further comprising program code for ordering the given item in theresult set according to the ranking score for the given content item.22. The computer readable media of claim 21 further comprising programcode for transmitting the result set to the user.
 23. A system forlearning ranking functions to determine the ranking of one or morecontent items that are responsive to a query, the system comprising: acomputer network; a search engine operative to receive a search querycomprising one or more terms from a user and to locate and retrieve oneor more content items and related information responsive to the searchquery; an indexing component operative to parse the search query intoone or more constituent terms and generate an index that defines astructure for the content items and related information that allows forlocation and retrieval of the content items and related information; anindex data store operative to store the one or more indexes generated bythe indexing component; a persistent data store operative to store ahistorical log of queries that users submit; and a ranking engineoperative to, generate one or more training sets comprising one or morecontent item-query pairs, determine preference data for the one or morequery-content item pairs of the one or more training sets, determinelabeled data for the one or more query-content item pairs of the one ormore training sets, determine a ranking function based upon thepreference data and the labeled data for the one or more content-itemquery pairs of the one or more training sets, and store the rankingfunction for application to query-content item pairs not contained inthe one or more training sets.
 24. The system of claim 23, wherein theranking engine is operative to determine a ranking function by derivinga boosted ranking function algorithm using a regression method basedupon quadratic approximation.
 25. The system of claim 23, wherein theranking engine is operative to utilize one or more regularizationparameters to control parameters of the ranking function.
 26. The systemof claim 23, wherein the ranking engine is operative to extract querydependent features for the one or more content item-query pairs.
 27. Thesystem of claim 23, wherein the ranking engine is operative to extractquery independent features for the one or more content item-query pairs.28. The system of claim 23, wherein the ranking engine is operative toprovide a given query-content item pair to a human subject to determinea relevance label.
 29. The system of claim 23, wherein the rankingengine is operative to determine a relevance label based on a historicallog of click through data for a given content item-query pair.
 30. Thesystem of claim 23, wherein the ranking engine is operative to apply theranking function to a new content item in response to receipt of a queryfrom a user.
 31. The system of claim 30, wherein the search engine isoperative to retrieve one or more content items in a result set inresponse to receipt of the query from the user; and the ranking engineis operative to determine a feature vector for a given content item inthe result set, apply the ranking function to the feature vector for thegiven content item, and generate a ranking score for the given contentitem.
 32. The system of claim 31, wherein the search engine is operativeto order the given item in the result set according to the ranking scorefor the given content item.
 33. The system of claim 32, wherein thesearch engine is operative to transmit the result set to the user.